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Algorithms  for  the  parallel  solution  of  problems  are  usually  designed 
assuming  an  unlimited  number  of  processors.  Physical  parallel  machines 
have  a  fixed  number  of  processors.  The  algorithm  contraction  problem 
arises  when  an  algorithm  requires  more  processors  than  are  available  on 
the  physical  machine.  We  present  tools  for  comparing  algorithm  contrac¬ 
tions  based  on  bottle  neck  communication  paths.  We  apply  these  tools  to 
minimum,  matrix  product  and  sotting. 

Introduction 

Algorithms  for  parallel  computers  are  usually  designed  assuming  an 
unlimited  number  of  processors.  For  non-shared  memory  parallel  algo¬ 
rithms,  (his  assumption  generally  manifests  itself  by  the  algorithm  utiliz¬ 
ing  one  processor  "per  point",  or  some  other  input  size-dependent  proces¬ 
sor  allocation.  The  physical  machine  has  only  a  fixed  number  of  proces¬ 
sors.  of  course,  which  will  almost  certainly  be  less  than  the  number 
required  by  the  algorithm.  In  order  to  make  the  logical  processes  of  the 
algorithm  conform  to  the  physical  processors  of  the  machine,  we  must 
group  processes  together  into  a  module  to  be  executed  on  a  single  physi¬ 
cal  machine.  This  activity  is  called  contraciu>n[l'S\.  The  way  this  con¬ 
traction  is  performed  can  have  a  significant  affect  on  performance. 

Consider  two  examples  based  on  an  grid  n  xit  of  processes,  i.e.  the 
processes  communicate  with  their  four  nearest  neighbors: 

(1)  There  is  much  process-lo- process  communicauon  and 
approximately  equal  computation  required  of  each  process. 

(2)  There  is  little  process-to-process  communication  and  the 
amount  of  computation  per  process  is  proportional  to  its  j  index, 
e  g.  processi.y  iterates  j  times. 

Suppose  we  have  only  one  fourth  the  required  number  of  processors  and 
now  compare  two  ways  of  forming  contractions  of  four  processes  per 
processor[4]:  Coalescing  groups  of  adjacent  2x2  subarrays;  folding 
groups  as  if  the  grid  is  folded  in  half  and  then  in  half  again,  i.«.  t . j 

(ISi.jSy)  is  associated  with  iVi-y+l,  n-i*l,j  and  n-i+M-;*l. 

Clearly,  algorithm  (1)  should  be  contracted  by  coalescing  because  the 
process-to-process  communication  for  the  processes  sharing  the  same 
processor  will  become  intraprocessor  communications  (i.e.  fast  memory 
references)  rather  than  slow  interprocessor  communication:  folding 
would  not  be  as  attractive  because  no  communication  is  saved  by  locality. 
Alternatively,  algorithm  (2)  should  be  contracted  by  folding  because  the 
work  is  balanced  since  each  processor  will  perform  a  matching  amount  of 
long  and  short  computations;  coalescing  would  not  be  as  attractive 
because  the  processors  receiving  processes  with  large  indexes  will 
become  a  bottleneck. 

Using  the  results  of  Berman  and  her  colleagues(3],  an  algorithm  can 
be  be  automatically  contracted,  and  this  seems  to  be  the  best  approach 
when  nothing  is  known  about  the  algorithm.  At  the  other  end  of  the  spec¬ 
trum,  however,  the  programmer  has  "complete*  know'-dge  about  the 
algorithm.  How  should  he  be  guided  when  performing  his  own  contrac¬ 
tion?  In  this  paper  we  develop  some  apparatus  to  guide  the  programmer 
who  must  contract  -  algorithm.  We  will  provide  some  case  studies  of 
contraction  that  show  an  unexpected  diversity  and  we  offer  some  general 
contraction  strategies  that  can  find  application  in  other  algorithms.  Con¬ 
traction  is  a  nontrivial  problem  for  parallel  programmers!  1 3).  and  jo  a 
secondary  goal  here  is  to  expose  it  as  an  important  topic  for  study  and  a 
subject  suitable  for  rigorous  analysis. 

Definitions 

The  generic  parallel  architecture  under  consideration  in  this  paper  is  a 
non-shared  memory  model.  It  is  a  collection  of  homogeneous  sequential 
computets  operating  asynchronously  and  connected  in  a  communication 
network  that  is  a  bounded  degree  gnph(13|.  A  single  "edge"  in  the  graph 
provides  bidirectional  communication  between  two  processors.  The 
CHiPl  11]  architecture  is  an  example  of  this  generic  architecture. 
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The  method  used  for  programming  this  model  consists  of  defining  a 
sequential  program  for  each  processor  and  a  communication  graph.  We 
are  assuming  a  configurable  architecture.  (The  problem  of  mapping  a 
communications  graph  onto  a  different  processor  connection  graph  is  dis¬ 
cussed  by  Berman  and  Snyder[41  and  BokharijS].)  Communicauon  is 
explicitly  shown  in  the  sequential  code  by  specifying  a  data  value  to  be 
sent  to  the  processor  connected  by  a  given  edge. 

The  algorithm  contraction  problem  arises  when  an  algorithm  that  is 
designed  for  use  on  it  processors  must  be  mapped  to  a  physical  parallel 
computer  with  only  pen  processors.  The  programmer  must  decide  which 
logical  processes  sre  to  be  rrupped  to  the  physical  processors.  Assuming 
that  the  logical  processes  have  balanced  toads  (they  run  for  the  same 
length  of  time),  we  would  like  the  physical  processors  to  have  balanced 
loads.  This  is  done  by  mapping  the  same  number  of  logical  processes  to 
exh  physical  processor.  The  number  of  logical  processes  assigned  to 
two  arbitrary  physical  processors  should  differ  by  at  most  an  additive 
constant  c.  For  most  contractions,  it  would  be  best  to  have  c«l. 

The  contraction  induces  s  communication  graph  for  the  p  physical 
processors.  This  new  graph  is  defined  by  logical  processes  needing  to 
communicate  with  other  logical  processes  not  mapped  to  the  same  physi¬ 
cal  processor.  We  assume  that  if  a  logical  process  in  processor  i  needs  to 
communicate  with  a  logical  process  in  processor  j.  there  is  a  physical 
edge  connecting  the  two  processors  in  the  new  graph.  The  contrxtion 
may  map  many  of  these  logical  edges  to  one  physical  edge  in  the  new 
graph.  Thai  is.  we  are  allowing  only  one  edge  between  physical  proces¬ 
sors.  Under  the  assumption  of  a  bounded  degree  graph  for  the  generic 
architecture,  this  induced  graph  meat  also  be  of  bounded  degree. 

As  an  example  of  contraction,  let  us  assume  we  have  an  algorithm 
with  a  tree  graph.  Consider  die  contrxtion  to  5  processors  shown  m  Fig¬ 
ure  1.  This  contrxbon  caused  an  increase  in  the  degree:  for  example,  the 
new  root  vertex  has  four  descendants.  Using  this  kind  of  a  contraction,  it 
can  be  shown  that  given  p  processors,  contracting  an  algorithm  with  at 
least  pl  logical  processes  requires  degree  p- 1.  Figure  2s  shows  a  con¬ 
trxtion  of  the  tree  to  4  processors.  An  extension  of  this  method  yields  a 
binary  tree  in  the  p  processors. 

Figure  2b  gives  another  contraction  id  4  processors.  This  contrxtion 
is  derived  by  the  recursive  tree  construction  gives  by  Leiserson[9). 
Given  two  instances  of  a  tree  each  with  an  associated  free  node,  we  can 
build  a  new  tree  and  a  associated  free  node.  This  produces  a  linear  area 
layout  in  the  plane  with  several  desirable  properties,  one  of  which  is  the 
constant  number  of  external  edges. 

In  this  paper,  we  will  be  considering  the  contribution  of  the  communi¬ 
cation  time  to  the  performance  of  the  contracted  algorithm.  Unless  other¬ 
wise  stated,  we  assume  that  a  communication  between  processing  ele¬ 
ments  costs  a  fixed  time  During  this  time,  no  other  communication  m 


Figure  I:  A  5  processor  contrxtion. 


Figure  2:  Two  4  processor  contractions. 
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the  same  direction  may  take  place.  We  are  specifically  allowing  all  edges 
to  have  simultaneous  communication.  Communication  internal  to  a  pro¬ 
cessor  coats  (he  died  lime  fj.  We  also  assume  that  t,  » 

We  would  like  to  develop  tools  for  reasoning  about  the  relative  merits 
of  different  contractions.  This  includes  their  communication  costs  and 
their  execution  times.  To  aid  in  this  objective  we  give  the  following 
definitions. 

Let  A  *(V£ )  be  an  algorithm  where  V  is  a  set  of  logical  processes  ( 
vertices  and  associated  programs  )  and  \V\*n,  E  is  a  set  of  edges 
(V j.V j).  V ,.V  je 

Let  M  (A  .p) »  B  be  a  contraction  of  algorithm  A  into  algorithm  B 
where  B  uses  p  processors  and  p  <  I  VA  | .  The  contraction  M  maps  ele¬ 
ments  of  VA  onto  V,  such  that  the  number  of  elements  of  VA  mapped  to 
an  arbitrary  element  of  V ,  differs  by  no  more  one  from  the  number  of 
elements  of  VA  mapped  to  any  other  element  of  V, . 

Let  w(e).  the  weight  of  *.  for  t  »(V|,V2),  he  the  larger  of  the 
number  of  messages  from  V ,  to  V  2  and  the  number  of  messages  from  V  2 
to  V 

Let  K  (A )  =  MAX  w  (e ),  for  t  e  E ,  be  the  communication  "cost"  of  A. 

a 

This  cost  is  an  estimate  of  the  minimum  communication  time  required  for 
the  algorithm.  Due  to  dependencies,  the  actual  communication  cost  may 
be  more. 

Let  T(A )  be  the  execution  rime  for  A. 

PROPOSITION  1:  For  a  given  A.  p.  M ,.  and  M 2.  and  t,  > if 
K(M  ,(A  p ))  <  K(M  J(A  p ))  then  T (M  ,(A  p ))  S  T(M  2<A  p )). 

This  proposition  is  formalising  the  notion  that  the  bottleneck  edge 
will  be  a  lower  bound  on  the  time  required  for  the  execution  of  the 
mapped  algorithm.  If  the  processors  have  a  small  amount  of  computation 
relative  to  the  communication,  the  execution  time  will  depend  on  the 
communication  time.  The  bottleneck  edge  of  the  contraction  M ,  will 
require  a  minimum  of  itK(Mi(Ap))  time,  which  is  less  than 
t'KiM^Ap)).  With  a  higher  minimum  communication  ume.  we  can 
not  expect  5f  2  to  execute  in  less  time  than  M If  the  processors  have  a 
large  amount  of  computation  in  ratio  to  the  communication,  the  computa¬ 
tion  time  will  dominate,  yielding  near  equal  times.  Even  in  this  case,  M , 
uses  less  time  for  communication  than  hit.  This  proposition  then 
motivates  us  to  map  the  busiest  edges  of  an  algorithm  to  internal  edges. 

Case  Studies 

We  now  look  at  several  parallel  algorithms  aid  some  contractions. 
We  approach  these  by  considering  algorithms  with  similar  communica¬ 
tion  graphs.  The  three  graphs  considered  are  the  tree,  grid,  and  binary  n- 

cube. 

Tree  algorithms 

There  are  several  algorithms  that  run  on  complete  binary  trees  (Figure 
I)  having  similv  characteristics,  like  the  aggregation  operations  of 
minimum  and  global  sum.  All  processors  have  a  value  and  we  want  to 
compute  a  global  value  that  depends  on  all  these  values.  Leaf  processor! 
send  their  value  (0  their  parents.  Internal  processors  take  the  minimum 
(sums)  of  their  own  value  and  their  children's  values  and  then  send  the 
result  to  their  parents.  The  final  value  will  be  computed  at  the  root  pro¬ 
cessor  in  0(lo$  n)  ume.  The  communication  in  these  algorithms 
requires  one  message  over  each  edge  for  each  global  minimum  (sum). 
For  a  single  minimum  we  have  K  (muumum )  ■  I. 

Consider  the  contraction  in  Figure  2a.  Lit  ua  call  this  contraction 
M  ^minimum p).  Each  edge  in  the  original  algorithm  requires  on*  mes¬ 
sage.  Each  edge  in  the  smaller  graph  hat  4  edges  from  the  origin^  graph. 
Since  we  have  only  one  connection  between  the  physical  processors,  we 
have  4  messages  for  each  edge.  For  an  arbitrary  a  (site  of  original  algo¬ 
rithm)  and  p  (the  number  of  processors)  we  have 

K(M  i{muumump))m  — . 

P 


A  similar  contraction  to  Figure  2a  is  touched  on  by  Berman  and 
Snyder[4).  Figure  3  shows  this  contraction.  This  is  achieved  by  'fold¬ 
ing'  the  tree.  As  Berman  and  Snyder  notice,  this  contraction,  M  2,  has 
K  (Af  ^minimum  p )) »  — . 

P 

Consider  the  contraction  m  Figure  2b.  Let  us  call  this  contraction 
M  ^minimum p).  We  note  that  each  edge  in  the  smaller  graph  has  at 
most  one  edge  from  the  original  graph  in  each  direction.  For  an  arbitrary 
n  and  p  we  have  K  (M  2( minimum  p ))  *  1. 

Proposition  1  tells  us  that  since  M  2  has  a  smaller  K,  it  is  the  prefer¬ 
able  contraction.  Both  M  \  and  M  2  depend  on  n  and  p  for  their  cost  But, 
M  j  has  a  constant  cost,  regardless  of  n  and  p .  In  fact,  this  contraction  is 
optimum  for  all  tree  algorithms  that  have  identical  edge  weights  and  uni¬ 
directional  communication  (all  toward  the  root  or  all  toward  the  leaves). 

We  lint  look  for  a  lower  bound.  Since  the  tree  is  connected,  the  phy¬ 
sical  processors  must  be  connected.  This  requires  at  least  one  incident 
edge  for  each  physical  processor.  The  smallest  cost  K (M  (A  p))  would 
be  where  a  maximum  of  one  logical  edge  was  mapped  to  a  physical  edge. 
Therefore.  K  (Af  (A  p ))  2  K(A ),  the  cost  of  the  original  algorithm. 

LEMMA  2:  For  complete  binary  tree  algorithms  with  balanced  pro¬ 
cessor  loads,  equal  edge  weights,  and  unidirectional  communication, 
algorithm  contraction  based  on  Leiserson’t  binary  tree  layout  technique 
yields  optimum  results. 

PROOF:  For  the  mapping  A#  5(A  p),  each  processor  contains  a  com¬ 
plete  subtree  and  an  'extra*  node.  The  extra  nodes  are  used  in  the  tree 
above  the  subtrees  contained  in  the  processors.  Therefore,  there  at  most 
4  external  connections.  Of  these  four,  two  edges  are  used  to 
receive!  tend)  data  from(to)  the  children  of  the  extra  node,  and  two  edges 
are  used  to  sendf receive)  data  toffrom)  the  subtree's  and  the  extra  node's 
parents.  Since  the  root  of  the  subtree  and  the  extra  node  are  not  at  the 
same  level  in  the  tree,  edges  with  data  flowing  in  the  ume  direction  can 
not  be  connected  to  the  same  physical  processor.  (It  is  possible  to  have 
two  of  these  edges  over  the  same  physical  edge,  but  the  data  moves  in 
opposite  directions.)  This  gives  the  same  weight  to  the  physical  edges  a 
the  original  edges.  Therefore.  K  (M,(Ap)) *K(A).  which  is  the  lower 
bound.  □ 

Notice  that  this  layout  technique  will  place  two  logical  edges  in  the 
same  physiciai  edge  for  some  physical  edge.  For  tree  algorithms  with 
bidirectional  communication,  we  then  get  K  (Af ,( A  p ))  •  2K  (A ). 

To  help  verify  there  results,  the  minimum  algorithm  wa  programmed 
using  the  Poker  parallel  programming  environment^].  Both  Af ,  and 
M,  were  programmed.  Each  contraction  wa  timed  using  4  and  16  data 
items  per  processor  with  4  and  16  processors.  The  results  of  these  tim¬ 
ings  are  given  in  Table  I.  Each  'tick*  represents  a  mucosecoad  on  the  64 
processor  Pringle. 

Grid  algorithms 

W«  next  look  at  algorithms  that  run  on  a  grid  inierconnecuon.  Con¬ 
sider  the  matrix  product  algorithm  fur  the  Wavefront  Array 
Processor]  WAP)(8].  It  uses  nl  processors  for  the  hm  matrix  product 
AM  *C.  The  data  is  fed  in  along  the  top  *  processors  and  from  the  left 
i  processors.  The  matrix  A  is  arranged  a  enaer  column  by  column,  start¬ 
ing  with  the  first  column.  The  matrix  B  is  arranged  a  enaer  row  by  row. 
starting  with  the  first  row.  (See  Figure  4.)  All  processors  execute  identical 
procedure*.  The  result,  c,(,  is  iniuilixed  a  rero.  A  loop  is  executed  n 
limes  that  reads  an  A  value  from  the  left  and  a  B  value  from  above,  mul¬ 
tiplies  them  together,  and  adds  the  result  a  c,(.  The  A  and  B  values  are 
sent  a  the  right  and  down,  respectively.  Thu  causes  die  upper  left  pro¬ 
cessor  a  be  the  first  processor  a  start  execution.  As  the  data  moves  into 
the  array,  there  is  a  wavefront  of  esecuung  processors  on  the  cross  diago¬ 
nal.  Each  edge  is  used  a  send  all  of  one  row  of  A  or  one  column  of  B . 
For  the  WAP  algorithm  we  have  K  (WAP)*  is . 

Consider  the  contraction  in  rig  are  5  Let  us  call  this  conmcoon 
M  ,(WAP  p ).  This  ts  the  contraction  done  by  cutting  the  graph  ina  p 
equal  six*  connected  subgraphs  and  assigning  on*  process  from  each  sub¬ 
graph  selected  from  corresponding  positions  a  a  single  processor.  The 


M inimam:  ticks  for  n  (items)  on  p  (processors) 
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Table  I'  Timings  of  the  minimum  algorithm. 
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Figure  4:  WAP  organisation 


Figure  5:  A  contraction  of  16  logical  processes  to  4  processors 

physical  connection  graph,  shown  in  Figure  5.  is  a  grid  with  end  around 
(i  e.  toroidal)  connecuons.  For  each  logical  process  ui  a  physical  proces¬ 
sor.  there  are  horizontal  and  vertical  communication  paths.  Since  we  have 

—  logical  processes  in  a  processor,  the  number  of  logical  edges  using 
p  n1 

one  pnocessor-to-processor  connection  is  — .  Since  all  horizontal  and 

p 

vertical  edges  have  the  same  number  of  messages,  ft.  we  have 

Af(.W,(WAP^i)) »  — . 

P 

Consider  the  contracuon  in  Figure  6.  Let  us  call  this  contraction 
M^WAPp).  This  is  the  contracuon  done  by  cutting  the  graph  into  p 
equal  size  connected  subgraphs  and  assigning  an  enure  subgraph  to  a  pro¬ 
cessor.  We  see  that  only  the  perimeter  processes  have  edges  that  go  from 
processor  to-processor.  Also,  notice  that  no  end  around  connecuons  are 
needed.  The  number  of  communication  paths  over  one  processor-to- 

processor  connecuon  is  V?  Each  commumcation  path  requires  n 

messages  giving  ArWjlWA/V))-  -7—. 

V 

Comparing  the  two  contracuons.  we  see  that  ff(Af,(WAP^))  is 
smaller  than  K (Af  ,(WAP  p ))  by  a  factor  of  -^L.  Proposition  1  tells  us 

that  Af  1  is  the  better  contraction.  We  conjecture  that  Af,  it  the  best  con¬ 
traction  that  can  be  achieved  for  grid  algorithms.  The  basis  for  this  con¬ 
jecture  is  that  this  contracuon  has  the  smallest  perimeter  for  a  given  area, 
and  has  been  commonly  used  for  contraction  in  published  algorithms,  for 
eaampie  for  the  Jacobi  iterauve  method)  1)  and  for  the  conjugate  gradient 
method(6|. 

Both  Af  |  and  Af,  were  programmed  using  Poker.  Table  2  summar¬ 
izes  the  results  of  the  timings.  As  predicted.  Af,  was  the  faster  contrac¬ 
tion,  but  because  the  communication  umn  is  not  die  only  ume  consuming 
part  in  these  algorithms  the  difference  is  perhaps  not  as  dramatic  as  might 
be  seen  on  a  larger  problem. 

Binary  a -cube  algorithms 

We  now  look  at  two  algorithm!  for  the  binary  n-cube.  The  Tint  algo¬ 
rithm  is  the  divide  and-conquer  algorithm  for  matnz  product  given  by 
Nelson)  10).  The  other  algorithm  is  Batcher's  bitonic  sorting  algo¬ 
rithm^). 

The  mains  product  algorithm  takes  two  it  xn  matnees.  A ,  and  B .  and 
computet  their  product  C  » AS.  A  and  B  are  assumed  to  be  in  row 

major  order  in  the  binary  n-cube  of  order  Ik ,  where  *  «  log  * .  The  algo¬ 


rithm  views  A  and  a  as  a  2x2  matrix  of  yx-i  matnees.  The  2x2  mama 

algorithm  is  then  used  to  multiply  the  submatrices.  Figure  7  shows  a 
order  4  cube  layed  out  in  ti.e  plane  using  the  CHiP  architecture.  The 
numbers  in  the  bozea  show  the  index  of  the  matnz  elements  initially  con- 
lamed  in  that  processor.  We  are  assuming  that  the  processors  are  num¬ 
bered  in  row  major  order.  The  dotted  boxes  show  cubes  of  order  2. 
These  cubes,  which  generally  have  order  2(k-l)  contain  an  — x—  sub- 

matrix  of  both  A  and  5.  Note  that  these  cubes  are  constructed  by 
removing"  the  edges  of  order  k  and  2k ,  where  and  edge  of  order  k  con¬ 
nects  processors  that  are  21*"11  distance  apart. 

To  compute  the  2x2  matrix  product,  all  processor!  exchange  values 
of  8  on  the  order  2*  edge  and  values  of  A  on  the  order  k  edge.  After  the 
exchange,  each  cube  of  order  2(k-l)  contains  4  submatrices  of  size 
y*y  ■  This  is  all  the  data  that  is  required  for  each  cube  of  order  2(k-i) 
to  compute  its  pan  of  the  2x2  matrix  product  independandy.  If  the  sub¬ 
matrix  is  not  a  single  element,  two  matrix  products  of  yxy  matrices  are 

required.  These  matrix  products  are  done  using  the  same  algorithm. 
Matrix  addition  is  dona  element  by  element  Because  corresponding  ele¬ 
ments  of  the  matrices  are  contained  in  the  same  processor,  no  communi¬ 
cation  is  required. 

To  find  the  con  of  this  cube  matrix  multiply.  JT(CA/Af ),  we  need  to 
find  the  edge  with  the  most  messages.  At  the  first  level  of  recursion,  the 
order  k  and  2k  edges  were  used  10  send  a  menage  each  way.  This  is  the 
only  use  of  these  edges  in  the  algorithm.  Therefore.  w(e ) »  1,  wtere  «  is 
a  order  k  or  2k  edge.  At  the  second  level  of  recursion,  two  matrix  pro¬ 
ducts  ire  computed  using  the  order  k-1  and  2k-l  edges.  Each  matrix 
product  sends  one  menage  each  way  on  each  edge  giving  w(«).2. 
where  r  is  a  order  k-1  or  2k-l  edge.  At  level  I  of  the  recursion. 
w(r)  »  2'*'  messages  over  the  order  k-<!-l)  and  2k-</-l)  edges.  The 
recursion  stops  when  we  have  order  2  cubes.  This  is  m  the  log  n  level  of 

recursion.  There  are  y  matrix  multiplies  done  by  order  2  cubes.  These 


Figure  6:  Another  contracuon  of  16  logical  proceues  10  4  processors. 


WAP  matrix  multiply:  licks  for  n  (items)  on  p  (processors) 
[Contractioir  16  on  4  64  on  16  64  on  4  256  on  16 

K 1  48854  MU)g  400452  “tolUf 

Ifj _  51113  73088  221545  7  7646 

Table  2:  Timings  of  the  WAP  matrix  multiply  algorithm. 
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Figure  7:  An  order  4  binary  n-cube. 
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order  2  cubes  use  the  order  t  and  4+1  edges.  Each  matrix  multiply  sends 
1  message  each  way  giving  w(r)  =  y,  where  e  is  a  order  1  or  4+1  edge. 

Since  this  is  the  largest  value,  K  (C.W.W )  =■  y. 

Consider  any  contraction.  Af(CAfAf^)  where  p  =  2"  for  some 
m  £  24.  Af  (CMM  4>)  will  map  —  logical  processes  to  every  processor. 
^  n* 

This  allows  us  to  put  a  cube  of  order  log  —  •2Jc-m  into  each  proces- 

P 

sor.  The  pnxessor-to-pnxessor  connection  graph  is  also  a  cube  and  is  of 

order  m .  Each  processor- to-processor  connection  supports  —  commun- 

p 

icauon  paths  in  the  original  graph.  The  real  question  is  which  sub-cube 
do  we  map  to  each  processor.  The  cost  of  the  contraction. 

/C(.M(C.Vf.Vf  .p))  will  be  —  times  the  maximum  w{«),  where  t  is 
P 

mapped  to  a  physical  edge.  If  e  is  order  1  or  24+1  from  the  original 

cube,  K  (M  (CAfAf  p )) »  — . 

2  P 

Consider  the  contraction  that  maps  the  edges  of  order  1  through 


24  -m 


and  order  4+1  through  4  + 
2Jc -m 


makes  the  edge  of  order  4  + 


24  -n 


into  internal  edges.  This 


sages.  This  edge  is  used  by  level  4- 

1  *-[¥]-' 
w(«)«2  1  ‘ 


+1  the  edge  with  the  most  met- 
24 -m 


of  the  recursion.  From 


before  we 


Af(Af(CAf\fp)) 


know  that 

n  'Jp 


2  ' 


Therefore 


Clearly,  this  contraction  is  better  in  terms  of 


the  number  of  messages  over  the  busiest  physical  edge  than  any  contrac¬ 
tion  that  does  not  keep  the  high  traffic  logical  edges  internal  to  a  proces¬ 
sor. 


By  contrast.  let  us  consider  the  Batcher  bitonic  merge  son  This  son 
runs  on  a  order  4  cube  »  sort  n  ■  2*  elements.  The  final  sorting  will 
have  the  smallest  element  in  the  first  processor  and  the  largest  element  in 
the  last  processor.  Figure  8  shows  a  graphical  representation  of  the  algo¬ 
rithm.  The  arrows  represent  a  data  exchange  and  a  compare,  leaving  the 
larger  number  at  the  end  with  the  arrow  and  the  smaller  at  the  other  end. 
It  is  obvious  from  the  figure  that  the  order  1  edge  has  the  most  messages. 
Therefore,  K  (SORT) « log  « . 

Again,  to  contract  this  algorithm,  we  see  that  we  warn  to  assign  a 
sub-cube  into  a  processor.  Consider  the  contraction  M(SORTf)  where 
the  edges  of  order  1  through  order  log  p  are  mapped  to  internal  edges. 
We  are  assuming  that  p  «2",  for  some  m  Slog*.  This  contraction 
assigns  the  busiest  logical  edges  to  be  internal  edges.  These  edges  carry 
log  n-log  p  messages.  Since  each  processor  contains  —  logical  proces- 

sors,  K  (M  (SORT  p )) »  "  V°S  n~toX  P),  Any  contraction  that  does  not 

P 

map  these  first  log  p  edges  to  internal  edges  will  have  a  higher  communi¬ 
cation  cost.  These  results  agree  with  and  explain  the  results  of  Hsiao(7], 
even  though  his  final  algorithm  was  embedded  in  a  grid  instead  of 
another  cube. 
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In  comparing  the  contractions  for  matrix  multiply  and  Batcher's  sort 
we  see  that  the  same  size  cube  is  mapped  in  a  different  way  whet 
mapped  to  the  same  number  of  processors.  The  busiest  edges  are  dif¬ 
ferent  for  the  two  algorithms,  thus,  the  contractions  are  different. 
Conclusion 

The  algorithm  contraction  problem  is  an  important  problem  for  paral¬ 
lel  programmers.  The  way  in  which  an  algorithm  is  contracted  can  have 
a  significant  affect  on  performance.  Processor-to-processor  communica¬ 
tion  can  be  used  as  a  lower  bound  on  the  execution  time  for  an  algorithm. 
It  is  the  processor-to-processor  communication  that  is  affected  by  dif¬ 
ferent  contractions. 

We  have  looked  at  algorithms  for  the  tree,  grid,  and  binary  n-cube 
interconnections.  For  each  algorithm  we  have  compared  possible  con¬ 
tractions  of  these  algorithms.  For  trees,  we  proved  that  Leiserson's  lay¬ 
out  technique  was  the  best  for  contracting  tree  algorithms  such  as 
minimum  and  sums.  For  grid  algorithms,  we  conjectured  that  coalescing 
by  maximizing  the  area  for  a  given  perimeter  is  optimal  far  the  algo¬ 
rithms  with  balanced  edge  loadings.  Finally,  we  showed  two  algorithms 
for  binary  n -cubes  that  required  different  contractions  to  produce  the 
optimal  results  for  the  algorithm. 
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Figure  8:  Batcher's  bitonic  merge  tort. 


